Dataset Information

Our data set is a sample set of accidents and related accident data compiled by the National Highway Traffic Safety Administration (NHTSA) 2018. Our specific data set is comprised of data from the Crash Report Sampling System(CRSS). This data is a sample of police reported accident data relating to many different aspects of an accident, including pedestrian, motor vehicle, property, etc. Every year, there are more than 6 million reported accidents, and this data set compiles data on accidents that are “of greatest concern to the highway safety community and the general public”.

Objective

Our objective is to gain a deeper insight into different aspects of accident trends. We have focused on a few different areas of accidents, including alcohol use as it relates to other data, area of the US, and others.

General Data About 2018 Collisions

Accident By Hour Of Day

Accidents By Day Of Week

Alcohol Use

When looking at the amount of alcohol involved in accidents, it was an overall small percentage compared with the overall number of accidents. However, when looking closer at just the accidents involving alcohol, certain trends started to become clearer.

It is hard to see the significance of alcohol when comparing to the overall amount of accidents in our data. However, we can see some trends emerge, wherein the proportion of alcohol involved in accidents increases on the weekend days and Friday, as well as in the evening and overnight hours.

Alcohol Use Vs. Total Accidents By Hour

Alcohol Use Vs. Total Accidents By Week

What is interesting when observing just accidents involving alcohol, is that the proportion of alcohol related accidents in the data increases as the overall amount of accidents decreases. This leads to an inverse of the overall accident graphs.

Alcohol Use In Accidents By Hour

Alcohol Use In Accidents By Week

Central Limit Theorem

The Central Limit Theorem (CLT) is a theorem that states that the distribution of sample means of a certain sample size of a population will be, in most cases, a normal distribution. As the sample size increases, the distribution will increasing have a normal distribution. We tried this out with the age of the driver from the PERSON dataset, and the theorem held true. As we increased the sample size, the distribution increasingly became normal.

## [1] "Population mean:  37.3305331448058  Population SD:  19.0429683609092"
## [1] "Population mean:  37.33  Population SD:  19.04"
## [1] "Sample size:  10 Sample Mean:  37.28 Sample SD 6.03"
## [2] "Sample size:  20 Sample Mean:  37.26 Sample SD 4.25"
## [3] "Sample size:  30 Sample Mean:  37.36 Sample SD 3.45"
## [4] "Sample size:  40 Sample Mean:  37.37 Sample SD 3.05"

Sample Size 10

Sample Size 20

Sample Size 30

Sample Size 40

Severities x Injuries

In this pie chart we can see the differences in number of injuries from each different type of severities. The biggest type of injuries classification is the possible injuries, where the injuries could be none to below minor. The second most injury is the minor injury, such as bruise and scrapes. The third is serious injury, where it is possible to be life-threatening and need hospital treatment. The least type of injury to occurred is fatal, which have higher chance of ending the life of the victim.

Urbanicity

Urbanicity data split between two values, “Urban” and “Rural” accident cases. And we can see from the total accidents data, that urban area have higher accident cases than rural area. This is expected, because urban areas have more traffic than rural area, especially in rush hours. The data concludes that total accidents in rural area only consists of around 30% from the total accidents in the urban area.

Severities x Urbanicity

Between the severities and urbanicities, we could see that all types of severities are more prominent in the urban area, while the rural area have less accidents overall. From the data we could conclude that each severities only have around 10% - 25% compare to urban accidents. And each severities-urban accidents are size appropriate to the total accidents, with the no-injuries sit at the highest, follow by possible-injury, then minor-injury, serious-injury, and fatal-injury.

Sampling

We are using 3 different sampling methods for the injuries severities data. The first sampling method is simple random sampling without replacement, where each severity have a chance of being selected as the other severities. The second one is systematic sampling, where the rule used is the total number of cases on each of these severities, divided by the number of sampling size we chose (in this case is 1000). And the final is stratified sampling, where the severities will be segmented into stratas based on the desired sample size (also 1000).

## Stratum 1 
## 
## Population total and number of selected units: 24354 18.39134 
## Stratum 2 
## 
## Population total and number of selected units: 10564 144.5395 
## Stratum 3 
## 
## Population total and number of selected units: 6861 513.0614 
## Stratum 4 
## 
## Population total and number of selected units: 873 222.5499 
## Stratum 5 
## 
## Population total and number of selected units: 4816 101.4578 
## Number of strata  5 
## Total number of selected units 1000

Additional Driver Information